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I.    Introduction 


In  response  to  the  multitude  of  econometric  models  which  have 
been  proposed  to  explain  various  economic  data,  a  sizable  literature 
on  ftie  problem  of  model  selection  has  developed  in  recent  years.   In 
particular,  a  number  of  authors,  have  proposed  that  the  model  with  the 
largest  Kullback-Leilpler  information  (KLI)  be  preferred  to  other  models 
(Akaike  (1973),  Sawa  (1978),  Amemiya  (1980),  Chow  (1980),  White  (1980)). 
These  authors  have  also  proposed  estimates  of  the  KLI  of  a  model.   These 
estimates  are  obtained  using  the  data  that  were   previously  used  to  ob- 
tain maximum  likelihood  estimates  of  the  parameters  of  the  model.   The 
main  problem  with  these  estimates  of  the  information  of  a  model  is  that 
they  result  from  specific  assumptions  about  the  relationship  between 
the  true  (unknown)  model  and  the  estimated  model.   In  particular, 
Amemiya  (1980)  makes  three  a.  priori  equally  desirable  assumptions  about 
this  relationship  and  obtains  three  different  estimates  of  the  Kullback- 
Lelbler  information.   This  arbitrariness  of  the  information  criterion 
for  model  selection  can  be  overcome  by  using  out-of-sample  data  in  the 
computation  of  an  estimate  of  the  KLI.   This  note  provides  such  an  estimate. 

It  must  be  noted  that  this  proposed  estimate  is  always  unbiased.   Instead, 
only  when  the  relationship  between  the  estimated  and  the  true  model  that 
gives  rise  to  one  of  the  estimates  of  the  KLI  proposed  in  the  literature 
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is,  in  fact,  correct,  is  that  estimate  unbiased. 

Knowing  that  a  statistic  tends  to  choose  the  best  model  is  often  not 
enough.   Instead,  it  is  desirable  to  know  whether  two  competing  models  are 
significantly  different  in  terms  of  their  quality.   This  note  provides  a  family 
of  tests  related  to  this  question.   In  particular  three  null  hypotheses  are 
shown  to  be  testable.   These  are  that  two  models  have  the  same  KLI,  that  they 
have  the  same  mean  squared  prediction  error  and  that  they  have  the  same  mean 
absolute  prediction  error.   It  must  be  noted  that  the  mean  squared  error 
criterion  for  model  selection  has  been  recently  advocated  by  Dutta  (1980) 
who  proved  that  the  true  model  has  on  average,  a  lower  mean  squared  error. 

The  tests  proposed  here  rely  on  a  strong  hypothesis  about  the  time 
independence  of  the  relative  quality  of  the  two  different  models.   I  will 
argue  however  that  this  independence  is  necessary  to  make  the  information 
criteria  for  model  selection  at  all  desirable.   A  somewhat  stronger  ver- 
sion of  this  independence  is  assumed  by  White  (1980)  who  proposed  a  test 
of  the  equality  of  the  KLI  of  two  models  using  in  sample  data  of  which 
the  tests  proposed  in  this  note  are  close  relatives. 

In  section  II  I  discuss  the  information  criteria  for  model  selection 
and  the  test  of  the  hypothesis  that  two  models  have  the  same'  information. 
In  section  III  I  present  the  tests  based  on  generalization  of  the  mean 
squared  forecasting  error  and  on  the  mean  absolute  forecasting  error.   In 

section  IV  these  tests  are  shown  to  be  applicable  to  models  with  time- 
varying  parameters. 
II.   Informatio_n  Criteria 

Let  the  vector  Y  ,  t  =  0,  1,...,  N  follow  a  stochastic 
process  whose  probability  density  function  is  g(Y  |x  )  where  X  is  a 
vector  of  variables  predetermined  at  t.   Consider  two  models  of  the 


stochastic  process  followed  by  Y  .   These  models  are  given  by  the  pro- 
bability density  functions   f(Y   I  7.  ,9)  and  h(Y   I  v  y")  where  Z  and  V 

t't  t't  t  t 

are  vectors   predetermined  at   t  while   6   and  y  are  vectors   of  parameters. 
The  KuLlback-Leibler   information  of   the  model   f   about   the   true  model 
g   at   t   is   given  by : 

~  f(Y  |z    e) 

I    (f.t)    =   /      log[  ^-t—   ]    g    (Y    |X  )    dY  (1) 

g    (YJX^)  t      t  t 

It  is  clear  that  I  (f.t)  -  0  where  the  equality  holds  only  if 
f  (Y^  I  Z^,e)  is  equal  to  g  (Y  |x  )  almost  everywhere  (c.f.  Rao  (1973) 
pp.  58-59).   Therefore  I  (f,t)  is  a  measure  of  how  good  the  model  f 
is.   In  particular  a  model  whose  information  is  larger  than  the  in- 
formation of  another  model  mimics  the  stochastic  process  of  Y  better 
and  is  therefore  a.  more  useful  forecasting  device.   Defining  I(h,t) 
as  in  (1)  I  now  define  the  difference  in  information  between  the  models 
f  and  h  at  t  by  J  (f,h,t)  whidh  is  given  by: 

"        f  (Y  |Z   0) 
J  (f,h,t)  =  /   log  [  ^-t ]  g  (Y  |X  )  dY  (2) 

h  (yJv^,y)        ^        ^ 

Similarly  one  can  in  general  define  the  difference  in  information  be- 
tween models  f  and  h  about  the  realizations  of  Y  between  0  and  N  as: 

N   "       f  (Y  |Z  ^ej 

'   J  (f,h)  =   I        /  log[  ^-^-  ]  g  (Y  |X  )  dY^  (3) 

t=0  -o        h  (YjV^^y) 

A  number  of  authors  have  proposed  estimates  of  J  (f,h)  (Akaike  (1973), 
Sawa  (1978),  Amemiya  (1980)  and  Chow  (1980)).   They  have  proposed  in 
particular  that  the  parameters  6  and  y  be  estimated  via  maxiumum  like- 
lihood over  the  same  sample  that  is  used  to  estimate  J  (f,h).   This 
makes  the  estimation  of  the  relative  information  of  the  two  models 
somewhat  awkward. 
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In  particular,  consider  the  natural  estimate  of  J(f ,h)  namely: 

J(f,h)  =  Lj  -  hj.  (4) 

N 
where  L-  =  ^Z^  log  f  (Y^|z^,  6) 

^d   L^  =  ^lo  log  h  (YjV^,  y) 

w?iere  6  and  y  ai^e  the  maximum  likelihood  estimates  of  9  and  y  for  the  sample 
in  which  t  goes  from  zero   to  N.   As  is  shown  ,  for  instance  by  Chow  (1980)  on 

Z  ,0)  over  all  possible  reali- 
zations of  the  process  g.   This  is  due  to  the  following  facts: 

On  the  one  hand,  the  maximum  likelihood  estimate  of  0  leads  in  a  small 
sample  almost  surely  to  a  larger  value  of  log  f  than  the  vector  9  which  would 
result  from  estimating  G  over  a  sample  which  included  infinite  realization 
of  the  process  g.   On  the  other  hand,  the  expected  value  of  the  likelihood 

in  the  population  is  larger  for  9  than  for  6.   Finally  the  expected  value  of 

N  _  N   °°  _ 

^Iq   log  f(Y^lz^,0)   is  indeed  equal  to  ^l^   f   log  f  (Y^\z^,Q)    g(Y^|X^)dY^. 

Therefore: 

N    oo  _  N    00 

\^f   >   tSo  -^  log  f(Y  |Z   0)  g(YjX JdY^  >  ^Eq  /  logf(Yjz^,e)  g(YjX^)dY^ 

&  —00  —00  . 

where  E  L;^  is  the  expectation  of  L^  where  for  each  realization  of  the  process  g, 
0  is  computed  to  minimize  L^. 

Hence,  the  in  sample  values  of  the  log  likelihoods  of  f  and  h  must  be 
corrected  for  their  optimism.   In  particular,  Lj  will  tend  to  overestimate 
the  average  E  log  f  by  a  larger  amount  the  more  parameters  are  estimated  and 
the  smaller  is  the  number  of  observations.   Furthermore,  the  bias  of  J(f,h) 
depends  on  the  true  model  g.   Amemiya  has  shown  that  making  three  equally 
reasonable  assumptions  about  the  relationship  between  f,  h  and  g  one  obtains 
three  different  estimates  of  J(f,  h) . 
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Instead,  I  propose  that  the  vectors  6  aiid  Y  be  chosen  according  to  a 
different  criterion  than  the  one  which  maximizes  the  probability  that  the 
model  to  which  they  correspond  be  selected  as  the  better  model.   In  particular 
the  vectors  9  and  y   could  be  the  maximum  likelihood  estimates  over  a  sample 
that  did  not  include  the  observations  in  which  t  ranges  from  zero  to  N.— 
Then  one  could  estimate  J(f,h)  by 

J  (f,h)  =  ^Eq  [log  f(Yjz^,  9)  -  log  hCYjVj.,  Y)]  (5) 

which  is  the  estimate  of  the  relative  information  of  f  and  h  that  I  propose 
to  use.   Note  that  J(f,h)  is  an  unbiased  estimate  of  J(f,h)  independently 
of  the  relation  between  f,  h  and  g.   This  is,  of  course,  not  true  of  J(f,h) 
nor  of  the  corrected  measures  of  relative  information  proposed  by  Amemiya  (1980) 
which  are  unbiased  only  if  the  maintained  assumptions  about  the  relationship 
between  f,  h  and  g  are  true. 

I  now  want  to  discuss  the  significance  of  the  estimate  J  (f,h) 
for  the  selection  of  econometric  models.   First,  suppose  that  the 
distribution  of  [log  f  (Y  |z  ,9)  -  log  h(Y  jv  ,y)]  is  not  independent 
over  the  index  t.   Then,  while  J  (f,h)  is  a  valid  estimate  of  the 
relative  quality  of  the  models  f  and  h,  it  is  a  rather  poor  one.   In 
fact,  it  corresponds  to  estimating  the  difference  between  the  means  of 
two  random  variables  by  the  difference  between  one  observation  of  each 
of  them.   Furthermore,  J  (f>h)  will  not  necessarily  select  the  model 
which  will  be  the  best  forecasting  tool  for  next  period  since,  by  then, 


5. 


the  distribution  of  [log  f  -  log  h]  will  have  changed  once  again.   Note 
that  this  problem  is  present  whether  one  estimates  the  difference  in 
information  between  f  and  h  with  in  sample  data  or  not. 

For  any  estimate  of  J(f^h)  to  be  useful  in  the  selection  of 
forecasting  devices,  one  must  think  that  the  relative  quality  of  the 
two  models  does  not  vary  too  much  over  time.   In  particular,  I  will 

assume  that  the  distribution  of  [log  f  (Y  |z  ,0)  -  log  h  (Y  |v  ,Y)]  does 

2/ 
not  change  with.  t.    This  will  automatically  be  satisfied  if  g,  f  and  h 

describe  iid.  variates ,  an  assumption  made  by  White  (1980)  and  Chow  (1980), 
or  if  X  ,  Z   and  V  are  iid  random  variables,  an  assumption  similar  to 
one  made  by  Amemiya  (1980).   Under  the  assumption  that   [log  f  (Y  |Z  ,9)  - 
log  h(Y  |v  ,y)]  is  iid  with  variance  O     one  can  test  the  hypothesis 
that  the  two  models  have  the  same  information  or,  in  other  words  that 

E   [log  f(Y  |Z  ,9)]=  E  [log  h  (Y  |V  y)]-  T^is  is  done  by  comparing  a 
g         t  t       g         t  t 

/\ 

simple  function  of  J  (f,h)  to  the  limiting  distribution  of  this  func- 

3/ 
tion  of  J(f,h)  under  the  above  assumption.—  This  limiting  distribution 

is  given  by  Prop.  1. 

Prop.  1.    If  [log  f  (Y  Iz  e)-  log  h  (Y Jv  ,y)]  is  iid  with  variance 

a^  ?^  0  and  J  (f,h,t)  =  0 

then  the  test  statistic  T  given  by: 
•  I 

1/9   N 
n"-"'^  I      [log  f  (Y  |Z  ,9)-log  h  (yJv^y)] 

^x  = '-^ 


N 

I    [log  f  (Yjz^,e)-iog  h  (yJv^y)]' 


t=0 

has  asymptotic  distribution N  (0 ,1) . 


Proof.  By   the  Lindb erg-Levy      central   limit   theorem   (c.f.    Rao   (1973) 

pp.128) 

1/9     ^ 
N^^^      I      [log  f(Y JZ    ,0)    -   log  h(Y    |V   ,Y)] 

t=0         ^      ^  ^ 


°I 


has  asymptotic  distribution  N  (0,1)  under  the  stated  assum- 

tions. 
Furthermore,  if  the  mean  of  [log  f(Y  )  -  log  h(Y  )]  is  zero,  then  the 
raw  second  moment  of  the  observations: 

N  2 

I      [log  f(Y^|z^,e)  -  log  h(Yj.|Vj.,Y)] 

t=o 

N 

2 
converges  in  probability  to  a   (c.f.  Rao  (1973)  pp.437). 


Q.E.D. 

Therefore,  if,  in  a  two  tailed  test,  T   turns  out  to  be  significantly 
bigger  than  zero  one  can  conclude  that  model  f  has  more  information 
than  model  h  provided  their  relative  information  is  not  time  varying. 

III.   Two  Other  Tests 

The  information  criterion  does  not  just  penalize  models  whose  conditional 
mean  is  far  away  from  the  realized  observations  but  also  penalizes  models 
which  have  a  poor  estimate  of  the  variance  of  Y.   Often,  in  econometric-^, 
one  is  interested  in  models  which  simply  produce  good  point  forecasts , 
that  is  whose  conditional  means  are  not  too  distant  from  the  typical 
realizations.  Therefore,  one  may  prefer  a  model  whose  prediction  errors 
are  smaller  even  if  its  information  is  lower. 
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Denote  by  f  and  h   the  point  forecasts  for  time  t  made  using  model  f 
and  model  h  respectively: 

oo 

\  =_/  \  f(Yjz^.e)  dY^ 

oo 

h   =  /  Y  h(YjV^,Y)  dY^ 

— oo 

There  are  various  criteria  by  which  the  forecasting  ability  of  two 
models  can  be  ranked.   I  will  consider  two  types  of  such  criteria. 
The  first  is  a  generalization  of  the  mean  squared  forecasting  error  (MSE) 
criterion.  Jhe  best  model  according  to  this  criterion  is  the  one  with 

the  lowest  value  for  the  expectation  of  e^Qe  where  e   is  the  dif- 
ference between  Y  and  the  vector  forecasted  by  the  model  while  Q  is 
a  positive  definite  weighting  matrix.   The  second  criterion  I  will 
consider  is  a  generalization  of  the  mean  absolute  forecasting  error  (MAE) 
criterion.   According  to  this  criterion  the  best  model  is  the  one  for 
which  the  expected  value  of  q^|e  |   is  lowest.   Here  | e  | is  the 
vector  whose  elements  are  the  absolute  values  of  the  forecast  errors 
while   q  is  a  vector  with  positive  entries. 

It  is  clear  that  it  is  easy  to  compute  the  actual  difference 
of  the  post-sample  average  squared  forei^^ist  error  and  average  absolute 
forecast  errors  for  two  models.   Once  again  these  statistics  will  bear 
a  strong  relation  to  the  relative  forecasting  ability  of  the  two  models 
only  if  the  relative  forecasting  ability  of  the  two  models  is  not  very 
influenced  by  changes  in  t.   I  will  assume  that,  in  fact,  the  relative 
predictive  accuracy  is  independent  of  time  and  derive  a  test  of  this 
proposition  together  with  the  hypothesis  that  the  forecasting  ability 
of  the  two  models  is  the  same. 


In  particular,  I  will  assume  sequentially  thau 

[(Y  -?  )   Q  (Y^  -  I  )  -  (Y  -  i7  )   Q  (Y  -  h  )]   and 
"■   t    t       t    t      t    t   ^   t    t 

q'  [(Y  -  f  )  -  (Y  -  h  )]  are  independently  identically  distributed 
and  that  their  mean  is  zero  to  derive  the  test  statistics  given  by 
Prop.  2  and  Prop.  3. 


Prop.  2.     If  [(Y  -  f  )'  Q  (Y  -  f  )  -  (Y  -  h  ) 'Q  (Y  -h  ) ] 

V  IV   J.         t  t         t  t         t  t       t 

2 
is  iid  with  variance  a  ?^  0  and 

s 

E  [(Y  -  I  ) '  Q  (Y  -  I  )]  =  E  [(Y  -  h  )  "Q  (Y  -  h  )] 
g   t    t   ^   t    X.'  g^  t    t   ^   t    t 

then 

N  . 


n"-"-^^  y   [(Y  -f)0(Y  -f)-(Y  -h)Q(Y  -h)] 

i     LV   J.       ^J        .      \      ^  J./      V   J.       t    ^   ^   t       t 

= 

s         N 

t=i  f(\  -  ^t^'Q  (\  -  ^t^  -  (\  -  V'Q  (\  -  \)i' 

Is  asymptotically  distributed  N  (0,1). 
The  proof  follows  the  lines  of  the  proof  of  Prop.  1. 

Prop.  3.     If  q'  [[  Y  -  f  I  -  q  I Y  -  h  I  ]  is  iid  with  variance  a.  7^  0   and 

yq^    l\-    fj]    =Eg    [q'|Y^-hJ] 

then 

-1/9   N  _  _ 

n'ThMIy    -fl-lY    -hi] 
z.^q      ''    t         t'        '    t         t' 

T        =     ^~^ .     

I     {q'[|Y^   -    \\    -    |Y^  -    f    |]}' 
t=0  ^  '^ 

is  asymptotically  distributed  N  (0,1). 

The  proof  follows  the  lines  of  the  proof  of  Prop.  1. 

Once  again,  under  the  assumption  that  the  difference  is  the  predictive 

power  of  the  two  models  is  independent  over  time,  a  very  large  value 

for  T.  or  T  leads  to  the  conclusion  that  model  h  is  preferable  to 
As 

model  f. 
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IV.   The  Varying  I-lodels 

So  far,  in  this  note,  the  vectors  6  and  y  were  taken  as  fixed.   However 
one  might  well  be  interested  in  comparing  the  quality  of  two  models  whose 
parameters  are  allowed  to  vary  over  time  in  a  predetermined  manner.   In 
particular,   when  evaluating  the  forecasting  ability  of  two  models,  various 
authors  (Fair  (1980),   Litterman  (1980))  have  studied  the  ability  of  models 
to  predict  the  t   observation  when  all  observations  up  to  the  t-1   are 
used  to  obtain  estimates  of  9  and  y.      Geisser  and  Eddy  (1979)  propose  instead 
to  use  all  observations  except  the  i    to  compute  the  parameters  which  are 
used  to  forecast  the  i   observation. 

Let  f(Y  |Z  ,9  )  and  h(Y  |V  ,y  )  be  the  models  whose  quality  one  wants 
to  compare  and  g  (Y  |x  )  be  the  true  model.   The,  if  9  and  y     do  not  depend 
on  the  realization  Y  : 

J(f,  R,  t)  =  log  f(Yjz^,  0^)  -  log  h(YjV^,  Y^) 

is  an  unbiased  estimate  of 


J(f,h,t)  =Jlog(^ 


^\ 


(\ 


— !vl  g(Y   X  )  dY 

V^,Y^)/  ^   t'  t'        t 


Furthermore  the  sum  of  the  J(f,h,t)  over  t  is  an  unbiased  estimate  of  the 
relative  information  of  the  two  models  in  the  sample  in  which  t  goes  from 
zero  to  N. 

If  one  further  assumes  that  the  distribution  of 

2 
[log  f(Y  |Z  ,9  )  -  log  h(Y  |V  ,Y  )]  is  iid  with  variance  a  ^  =/ 0  then  one  can 

use  a  statistic  of  the  form  of  T  with  9  replaced  by  9.  and  Y  replaced  by  y 

to  test  the  hypothesis   that  the  two  models  have  the  same  information. 

Furthermore,  letting: 


10. 


f,  =  /  Y^  f(Yjz^.  e^)dY^ 


and  hj.  =  /  Y^  h(Yjv^,  y^)  dY^ 


— oo 


one  can  use  Prop.  2  and  Prop.  3  to  test  whether  the  forecasting  ability  of  the 
two  models  is  the  same  as  long  as  one  assumes  that  the  distribution  of  their 
relative  forecasting  ability  is  time  invariant. 
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Footnotes 

1/ 

This  begs  the  question  of  how  many  of  the  observations  should  be  used 

to  estimate  the  models  and  how  many  should  be  used  to  ascertain  their 

relative  validity.   This  arbitrariness  however,  is  of  a  different  order 

than  the  arbitrariness  of  the  in-sample  estimates  of  the  difference  in 

information  between  the  two  models.   This  is  so  because  certain  estimates 

of  this  difference  are  systematically  more  favorable  to  bigger  models  than 

others  while  the  arbitrariness  involved  is  the  selection  of  a  size  for 

the  post-sample  does  not  permit  a  systematic  bias  in  favor  of  a  particular 

class  of  models . 

2/ 

—One  could  also  assume  that  this  distribution  depended  on  time  in  an 

ad  hoc  manner.   However,  the  advantages  of  this  procedure  are  not  clear 
to  me.   Fair  (1980>  does  in  fact  provide  a  method  for  evaluating  econo- 
metric models  in  which  he  allows  their  quality  to  change  over  time. 
However,  this  procedure  does  not  at  this  point  have  a  firm  statistical 
foundation. 

3/ 

—  White  (1980)  derives  a  similar  distribution  which  can  be  applied  to  a 

variety  of  estimates  of  J  (f,h)  which  use  the  same  data  as  is  used  to 

generate  0  and  Y'   However,  in  medium-sized  samples,  the  significance 

level  of  his  test  will  depend  crucially  on  which  of  the  a_  priori  equally 

good  in  sample  estimates  of  J  (f,h)  is  employed. 
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